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Abstract 

Citrus, as one of the globally important fruit trees, has been an object of interest for understanding gen- 
etics and evolutionary process in fruit crops. Meta-analyses of 19 Citrus species, including 4 globally and 
economically important Citrus sinensis, Citrus Clementina, Citrus reticulata, and 1 Citrus relative Poncirus 
trifoliata, were performed. We observed that codons ending with A- or T- at the wobble position were pre- 
ferred in contrast to C- or G- ending codons, indicating a close association with AT richness of Citrus 
species and P. trifoliata. The present study postulates a large repertoire of a set of optimal codons for 
the Citrus genus and P. trifoliata and demonstrates that GCT and GGT are evolutionary conserved 
optimal codons. Our observation suggested that mutational bias is the dominating force in shaping the 
codon usage bias (CUB) in Citrus and P. trifoliata. Correspondence analysis (COA) revealed that the prin- 
cipal axis [axis 1; COA/relative synonymous codon usage (RSCU)] contributes only a minor portion 
(~ 10.96%) of the recorded variance. In all analysed species, except P. trifoliata, Gravy and aromaticity 
played minor roles in resolving CUB. Compositional constraints were found to be strongly associated 
with the amino acid signatures in Citrus species and P. trifoliata. Our present analysis postulates compos- 
itional constraints in Citrus species and P. trifoliata and plausible role of the stress with GC 3 and co- 
evolution pattern of amino acid. 
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1. Introduction 

Genome composition (such as GC- and AT-content) 
and subsequent balance of codon usage and eukary- 
otic translation machinery play an important role in 
the evolution of the nucleotides at the wobble pos- 
ition. 1 Degeneracy in biased codon usage explains 
the concept behind the usage of same amino acids 
with multiple synonymous codons except methionine 
(Met) and tryptophan (Trp). 2 This genome compos- 
ition bias leads to the usage of some codons at a 
higher frequency as compared to the synonymous 



codons for encoding the particular amino acid. 
These differences in the usage of the synonymous 
codons have also been one of the factors for the evo- 
lution of proteome diversity and can help us to under- 
stand the evolution of those proteins that have 
structural differences in spite of being conserved at 
the sequence level. 3-7 

Two major paradigms of codon usage were pro- 
posed as plausible answers to clarify the non-random- 
ness of codon usage at intra- and inter-species levels: 
(i) natural selection that is expected to yield a correl- 
ation with codon bias in highly expressed genes, 8 
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rapidly regulated and variably expressed genes, and 
(ii) neutral processes, such as mutational biases 
(MBs), where some mutations occur more often 
than others across the genome of an organism 
because of local variations in the base composition. 
Several lines of evidence support the argument that 
the usage of synonymous codons with unequal fre- 
quencies in both prokaryotic and eukaryotic genes is 
the result of a complex balance between MB and/or 
natural selection and genetic drift. 9-12 It has been 
suggested that the codon bias could be positively 
selected because of a more efficient and accurate 
translation, and favoured codons may correspond to 
the most highly expressed genes. 2,13 In addition to 
neutral and selection processes, GC-biased gene con- 
version, which depends on the local recombination 
rate, is an important factor in shaping codon and 
amino acid usage. 14 

To date, wide variations have been observed in 
codon usage patterns in many organisms and have 
provided clues to understand the evolution of genes 
and gene families. The factors that potentially affect 
biased usage of codons are MB that correlates the 
codon usage bias (CUB) with the genomic GC 
content, Hill -Robertson effect explaining the inter- 
ference of selection of one locus with another locus, 
translational selection, and a cumulative effect of rep- 
lication and translational selection in shaping the 
codon usage across the genes of several bacterial 
species. 1 5-19 It has been shown that CUB and 
protein functional conservation play a major role for 
the decelerated evolution of whole genome duplica- 
tion in Saccharomyces cerevisiae. Several other 
factors that might affect the codon usage are amino 
acid conservation and hydrophobicity, 21 gene expres- 
sion, 22 mRNA folding stability, codon -anticodon 
interaction, and gene length. 22 

Citrus is a diploid genus of the Rutaceae family, 
whose cultivated forms are important for human 
diet with more than 122 MT of annual world fruit 
production. We have systematically analysed codon 
usage patterns and genomic heterogeneity using a 
meta-analyses approach that involves multivariate 
tools and codon usage indices such as relative syn- 
onymous codon usage (RSCU) and effective number 
of codons (Nc) in the studied gene sets. 23,24 In 
present study, we inferred global pattern of CUB, mu- 
tational pressures, and association of GC 3 with poten- 
tial stress events. We have collectively analysed 
horticulturally important Citrus species that includes 
recently sequenced double haploid Citrus sinensis 
genome 25 and 1 8 additional Citrus species with 
expressed sequence tags (ESTs) counts of more than 
1 000 per species. To make this analysis comparative, 
we have also included Poncirus trifoliata that is consid- 
ered to be a distant relative of Citrus genus. These 



species show wide variation in traits such as cold 
and drought tolerance and are of commercial import- 
ance. We have restricted our analyses to Citrus genera 
to develop resource information for Citrus species. 

Our study demonstrated that CUB in Citrus species 
and P. trifoliata is biased towards AT richness, and a 
relatively higher occurrence of A- and T-ending 
codons was observed. We found several interesting 
and diverse patterns of optimal codons, and it was 
observed that two optimal codons coding for 
Alanine and Glycine were evolutionary conserved 
between the Citrus species and P. trifoliata. To the 
best of our knowledge, our analysis presents the first 
complete report on the identification of optimal 
codons across the entire Citrus genus and P. trifoliata, 
which could serve as the potential source for develop- 
ing transgenic Citrus cultivars using codon optimiza- 
tion. GC 3 and evolution were correlated using 
Hamming distance parameter to identify suggestive 
evolutionary pairs of genes. Co-orthologous genes 
matrix identified using the alignment ratio suggested 
close association of Citrus Clementina, Citrus sinensis, 
and Citrus reticulata, as they belong to the same 
phylogenetic clade. We further demonstrated that nu- 
cleotide bias has a genome-wide influence on amino 
acid composition of Citrus species and P. trifoliata. 

2. Materials and methodology 

2. 7. Sequence information and processing 

Our dataset consists of genome-predicted coding 
regions and ESTs. All the genome-predicted coding 
sequences of C. sinensis were retrieved from recently 
sequenced Citrus genome. 25 In case of other Citrus 
species, ESTs were downloaded from the National 
Center for Biotechnology Information (NCBI) EST 
repository (NCBI; http://www.ncbi.nlm.nih.gov). In 
addition, we downloaded putative unique transcripts 
for all studied species from the PlantGDB database 
(http://www.plantgdb.org). A detailed description of 
the data used is shown in Table 1 . In case of ESTs, the 
ESTs were first clustered into contigs and singletons 
using CAP3 with default parameters. 26 A minimum 
match percentage cutoff of 95% for 40 overlapping 
bases was used to assign 2 sequences to a cluster. 

2.2. Frame correction of ESTs 

All the UniGenes (contigs + singletons) were ana- 
lysed for frame correction and prediction of protein- 
coding region using FrameDP. 27,28 Briefly, the follow- 
ing pipeline was implemented using FrameDP to 
identify open reading frames (ORFs): firstly, each EST 
was compared against the TAIR database (Arabidopsis 
Information Resource; http://www.arabidopsis.org/) 
using BLASTX 29 with E-value = 1 0" 4 , identity 
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Table 1. Genomic composition of coding region of Citrus species and P. trifoliata at GC, GC,, GC 2 , GC 3 , and GC 3s 



No. 


Citrus species 


Genes/EST a 


Unigenes count 


Genes b 


GC 


GC, 


GC 2 


GC 3 


GC, 


1 


C. sinensis 


44 275 




41 362 


43.92 


50.4 


40.33 


41.03 


38.8 


2 


C. aurantifolia 


821 9 


7550 


471 5 


47.78 


52.96 


43.87 


46.52 


44.73 


3 


C. aurantium 


1 4 584 


1 1 952 


6787 


46.93 


52.47 


42.59 


45.72 


43.94 


4 


C. Clementina 


1 1 8 365 


51 591 


33 765 


47.04 


51.84 


42.51 


46.76 


44.99 


5 


C. Clementina x C. tangerina 


1 843 


1 283 


677 


44.66 


51.67 


39.77 


42.55 


40.47 


6 


C. jambhiri 


1 01 7 


858 


701 


45.86 


51.79 


41.1 3 


44.66 


42.67 


7 


C. japonica var. margarita 


2924 


1 628 


588 


46.1 1 


52.67 


41.42 


44.25 


42.35 


8 


C. limettioides 


81 88 


7933 


3964 


46.47 


50.76 


42.75 


45.9 


44.25 


9 


C. limonia 


1 1 045 


9761 


4256 


45.82 


51.1 3 


41.37 


44.96 


43.1 7 


1 0 


C. medica 


1115 


896 


71 1 


45.65 


51.91 


40.9 


44.1 3 


42.1 5 


1 1 


C. reshni 


5768 


3735 


2644 


45.2 


51.51 


40.57 


43.52 


41 .5 


1 2 


C. reticulata 


55 980 


46 1 70 


30 81 6 


46.85 


52.3 


42.77 


45.48 


43.62 


1 3 


C. reticulata x C. temple 


5823 


3685 


2253 


46.48 


52.41 


41.64 


45.41 


43.41 


1 4 


C. sinensis x P. trifoliata 


1 837 


1 522 


992 


46.07 


51.67 


41.41 


45.1 5 


43.23 


1 5 


C. sunki 


521 6 


4688 


1 746 


47.58 


51.51 


43.58 


47.64 


46.1 7 


1 6 


P. trifoliata 


62 695 


35 740 


1 9 445 


46.78 


52.33 


42.67 


45.35 


43.52 


1 7 


C. unshiu 


19 072 


9289 


4592 


45.67 


51.72 


41.14 


44.1 5 


42.21 


1 8 


C. paradisi 


8039 


4621 


251 7 


45.32 


52.1 9 


40.51 


43.27 


41.3 


19 


C. paradisi x P. trifoliata 


7954 


3335 


2596 


45.23 


51.96 


40.77 


42.98 


40.96 



Means of GC% were calculated at the first, second, and third positions. GC 3s represents the GC at the third synonymous 
position. 

a ln C. sinensis, genome-predicted coding regions are used, whereas for the rest of the species, the number represents the EST 
count in the column (genes/EST). 
b Selected genes above 300 bp threshold. 



percent (%) = 40% over 1 00 amino acids. Secondly, 
the training dataset was generated from the BLASTX 
results, and, subsequently, the training matrix was cal- 
culated, which represents the coding style of the 
species. Thirdly, a collection of putative protein- 
coding sequences (CDSs) was generated for each 
Citrus species and P. trifoliata based on its homology 
with known protein dataset and on coding style 
recognition matrix. 

2.3. Sequence filtering and GC variation 

From the set of corrected sequences, we discarded 
proteins shorter than 1 00 amino acids to create a re- 
liable dataset for all the studied Citrus species and P. 
trifoliata for further analysis. 23 The final sequence 
dataset was subsequently analysed by tabulating the 
frequency of GC at the first, second, and third codon 
positions (GC 1( GC 2 , GC 3 , and GC 3s , respectively) 
using in-house written Perl and C++ scripts. GC 3 is 
defined as the fraction of cytosines (C) and guanines 
(G) in the third position of the codon: GC 3 = 3(C 3 + 
G 3 )/L for the ORF of length L, whereas GC 3S is 
defined as G + C base composition at the third syn- 
onymously degenerate position of codons. To define 



GC 3 -rich and GC 3 -poor groups, we have selected 5% 
of the genes with the highest and the lowest GC 3 
values. For the genes in the GC 3 -rich and-poor 
groups, we have computed two measures: (i) position- 
al gradients of GC 3 and (ii) CG 3 skew that are defined 
as: CG 3 skew is the difference in fraction of cytosines 
(C) and guanines (G) in the third position of the 
codon divided by the sum of C and G in the third pos- 
ition: CG 3 -skew = (C 3 -G 3 )/(C 3 + G 3 ) and positional 
gradient GC 3 as a [G 3 (x) +C 3 (x)]/Nseq, where x is 
the distance measured as number of codons from 
the first ATG, and Nseq is the number of sequences. 

2.4. Indices of codon usage and correspondence 
analysis 

The effective number of codons (Nc) provides an in- 
dependent measure of CUB, regardless of the gene 
length. 23 The expected Nc values were computed 
according to the equation proposed by Wright, 
which assumes equal use of G and C (A and T) in de- 
generate codon groups. 23 

f 29 1 

Nc = 2 + s + < =- > , where s = GC 3s . 

1[ S 2 + (1 -S) 2 ]/' 
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We have further analysed RSCU using multivariate 
analysis for all the 59 informative codons (excluding 
Met, Trp, and the three stop codons). 30,31 As pro- 
posed earlier, if RSCU values are close to 1.0, it indi- 
cates that all the synonymous codons are used 
equally without any bias towards the usage of a par- 
ticular codon in a gene of length L. In our study, axis 
1 (COA/RSCU) and axis 2 (COA/RSCU) represent the 
first and second major axes of correspondence ana- 
lysis. The Kyte-Doolittle scale was used to calculate 
the hydropathy score that is the arithmetic mean of 
the sum of hydropathic indices of each amino acid. 
This scale also provides information on transmem- 
brane or surface regions. 32 

2.5. Identification of optimal codons 

Optimal codons are believed to achieve faster trans- 
lation rates and high accuracy, therefore the effect is 
more pronounced in highly expressed genes. 33 For 
the identification of optimal codons, we used 10% 
of total genes from extreme ends of the principal 
axis of correspondence analysis (axis 1 ; COA/RSCU). 
Codon usage was then compared using x 2 contin- 
gency (x 2 ) of the two groups, and codons whose fre- 
quency of usage was significantly higher at three 
different levels of statistical precision (P-value < 0.5; 
P-value < 0.01 ; P-value < 0.001 ) in highly expressed 
genes when compared with lowly expressed genes 
were defined as the optimal codons. We also identi- 
fied the genes related to the ribosomal proteins and 
observed the association of the ribosomal proteins 
with axis 1 (COA/RSCU) and axis 2 (COA/RSCU) of 
the correspondence analysis to optimize the identifi- 
cation of the optimal codons. 

2.6. GO and stress-associated annotation, Hamming 
parameter, and coevolution of amino acid 
composition 

Stress- related genes were identified using reciprocal 
best bidirectional hits (RBH) using NCBI BLASTP 
program with E-value 1 0~ 30 , and gene ontology 
(GO) annotation of the model dicot plant 
Arabidopsis thaliana was then used by 'guilt-by-associ- 
ation' approach. To optimize our annotation pipeline, 
we selected GO annotations 'involved in' the category 
containing word 'stress' and/or 'response' and 
grouped together as 'stress related'. The normalized 
Hamming distance was then evaluated according to 
the method described in Sablok et al. 34 All the 
frame-corrected coding regions and the genome-pre- 
dicted coding regions were parsed. Subsequent pro- 
teins were used to identify co-orthologous genes, 
and a suggestive phylogeny was drawn using the co- 
orthologous matrix as described in Lechner et al. 35 
To identify the effect of nucleotide bias on amino 



acid composition, we partitioned the codons accord- 
ing to GC-rich (the so-called GARP amino acids: 
Glycine, Alanine, Arginine, and Proline) and AT-rich, 
FYMINK amino acids (Phenylalanine, Tyrosine, 
Methionine, Isoleucine, Asparagine, and Lysine) 
amino acids. 36,37 We have maintained the exclusion 
of Leucine and Arginine from the dataset as per 
Singer and Hickey. 37 

2.7. Statistical analysis 

All the indices of codon usage were calculated using 
CodonW (http://codonw.sourceforge.net), and several 
in-house developed Perl, R, and C++ scripts were 
written to streamline the downstream analyses. 
Statistical analyses were performed using R in R 
studio (http://rstudio.org/). All the results were inter- 
preted based on the non-parametric Spearman's rank 
correlation (p). 

3. Results and discussion 

3.1 . Patterns of genomic content and codon usage 
variation across Citrus species and P. trifoliata 

The recently sequenced nuclear genome of C. sinen- 
sis 25 presents an opportunity to analyse the nucleo- 
tide compositional pressure and heterogenetic 
variation, which could possibly help us to understand 
the molecular adaptation of these horticulturally im- 
portant fruit species. We have used meta-analyses ap- 
proach using the predicted genomic coding regions of 
recently sequenced C. sinensis and frame-corrected 
ESTs of related species and P trifoliata. Nucleotide 
composition analysis indicated that Citrus species 
are AT rich (~46% GC, typical for many sequenced 
genomes of dicot species), and this bias towards the 
AT richness is dominant and is observed across all 
the studied species (Table 1). 

Based on the above observation, we could hypothe- 
size that these species are biased towards the A- and/ 
or T-ending codons in the coding region across these 
two genera. Generally, most of the analysed dicot 
species prefer to use A- or T-ending codons at the 
third position; this observation is in accordance with 
a previous study in Citrus using a small dataset of 
177 CDS regions. 38 We found that average GC 3 is 
significantly lower than GCt and the observed GC 
(P-value < 0.05) which potentially explains that the 
genome composition is biased towards the dominat- 
ing AT usage encoding genes at the third, 'wobble' 
position. 

In eukaryotes, the intra-genomic heterogeneity is 
high, and interspecific variation of the average GC 
content is low. 39,40 In Citrus species, GC usage varies 
by position, but with much greater variance, with 
higher usage in position 1 (~51.8% GC) and lower 
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GC usage in positions 2 and 3 (~41 .6 versus ~44.7% 
and also at the third synonymous positions GC 3s ~ 
42.8%). The observed results are consistent with the 
previous reports demonstrating preference for G is in 
position 1 and T/A in positions 2 and 3 41 and demon- 
strate the role of mutational pressure in the evolution 
of CUB across Citrus species and P. trifoliata. 
Furthermore, there is a significant wide variation in 
GC usage at the third synonymous position (GC 3s 
~42.8%) (Table 1; Fig. 1). In our analysis, we esti- 
mated codon usage patterns of frame-corrected 
coding regions obtained from ESTs versus gene predic- 
tions derived from a fully assembled and annotated 
genome. 

To look for biased association between the nucleo- 
tide content and codon usage, we plotted Nc against 
GC 3s , also described as the Nc plot (Fig. 2) that has 
been widely demonstrated as an important param- 
eter to evaluate codon usage variation among genes, 
as this codon usage index has a definite relationship 
with the GC 3 (more compositionally biased DNA is 
expected to encode a smaller subset of codons). 23,42 
Wright 23 argued that the comparison of actual distri- 
bution of genes, with expected distribution under no 
selection could be indicative, if CUB of genes has 
some other influences other than compositional con- 
straints. A significant correlation was found between 
Nc and GC 3s (0.466** C. sinensis; -0.02 5** Citrus 
aurantifolia; 0.070** Citrus aurantium; 0.12 5** 
C. Clementina; 0.1 83** C. Clementina x Citrus tanger- 
ina; 0.1 84** Citrus jambhiri; 0.1 79** Citrus japonica 



var. marqarita; -0.048** Citrus limettioides; 
-0.031** Citrus iimonia; 0.093** Citrus medica; 
0.21 6** Citrus reshni; 0.061 ** C. reticulata; 0.1 01 ** 
C. reticulata x Citrus temple; 0.1 1 9** C. sinensis x P. 
trifoliata; -0.21 3** Citrus sunki; 0.061 ** P. trifoliata; 
0.139** Citrus unshiu; 0.160** Citrus paradisi; 
0.21 2** C. paradisi x P. trifoliata; **P-value < 0.01 ). 

We noted that most genes tend to lie below the 
standard trajectory path and are poised towards 
GC 3s , which clearly demonstrates that MB is acting 
as a major factor for the wide variation in codon 
usage across the Citrus species and P. trifoliata 
(Fig. 2). However, there might be additional factors 
that might influence codon usage across these 
species. This dependence of codon usage on genome 
base composition (AT- orGC-richness) has been previ- 
ously reported in several unicellular genomes. 43,44 In 
a genome-wide analysis of eubacterial and archeal 
genomes, it has been suggested that genome-wide 
variance in codon usage is primarily due to MB as 
the GC content shows wide variation along the iso- 
chors. 17 Simultaneously, alternative views associate 
GC variability with various factors such as transcrip- 
tional optimization, methylation, recombination, and 
horizontal gene transfer. 45 It can be inferred that 
the cloning of orthologs and homologues conserved 
across the Citrus species and P. trifoliata that are AT 
rich and have low Nc values will require few degener- 
ate primers. On the contrary, genes that are GC rich 
and having high GC values will require more degener- 
ate primers for enhancement of cloning efficiency. 
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Figure 1. The distribution of GC 3 content in Citrus species, P. trifoliata and A. thaliana genes. The CC 3 content showed unimodal 
distribution. 
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Figure 2. Nc versus GC 3s plot of Citrus species and P. trifoliata genes. The solid black line indicates the expected Nc value, if the codon bias is only due to GC 3s Species order: A= C. 
sinensis; B = C. aurantifolia; C = C. aurantium; D = C. Clementina; E = C. Clementina x C. tangerina; F = C. jambhiri; G = C. japonica var. margarita; H = C. limettioides; I = C. limonia; 
j — C. medica; K = C. reshni; L=C. reticulata; M = C. reticulata x C. temple; N = C. sinensis x P. trifoliata; O — C. sunki; P = P. trifoliata; Q — C. unshiu; R=C. paradisi; and S = C. 
paradisi x P trifoliata. 
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3.2. Correspondence analysis 

It has been reported that there is a significant het- 
erogeneity within the genes and genomes. 33,46 A 
heat map was constructed using the observed RSCU 
values for all the 59 informative codons (excluding 
the Met, Trp, and the stop codons) using the average 
linkage clustering method (Fig. 3). Heat map and 
the supporting values (Supplementary Table 1) 
clearly define that the global codon usage is biased 
towards AT richness. We have partitioned the genes 
based on the high and low GC 3s content to see the 
global pattern of deviation across the two principal 
axes displaying the major variation in accordance 
with the synonymous GC 3s (Supplementary Fig. 1). 

Correspondence analysis showed that relative 
inertia tends to decrease along the axes, and in all 
the studies species, it was observed that axis 1 
(COA/RSCU) contributed to the major portion of rela- 
tive inertia indicating that the major trend of codon 
usage variation is associated with axis 1 (10.75 C. 
sinensis; MA 5 C. aurantifolia; 12.27 C. aurantium; 
14.54 C. Clementina; 6.65 C. Clementina x C. tangrina; 
11.62 C. jambhiri; 7.01 C. japonica var. margarita; 
10.41 C. limettioides; 14.66 C. limonia; 11.24 C. 
medica; 10.67 C. reshni; 12.54 C. reticulata; 8.66 C. 
reticulata x C. temple; 1 0.85 C. sinensis x P. trifoliata; 
13.61 C. sunki; 12.06 P. trifoliata; 9.11 C. unshiu; 
8.72 C. paradisi; 10.75 C. paradisi x P. trifoliata). 
High significant correlation (+/-) was also observed 
between axis 1 and GC 3s (0.860**, C. sinensis; 
0.888**, C. aurantifolia; 0.887**, C. aurantium; 
0.903**, -0.708** C. Clementina; -0.836**, 



C. Clementina x C. tangrina; -0.7 72**, C. jambhiri; 
-0.831**, C. japonica var. margarita; -0.803**, 
C. limettioides; 0.849**, C. limonia; 0.894**, 
C. medica; -0.855**, C. reshni; 0.894**, C. reticulata; 
0.846**, C. reticulata x C. temple; -0.884**, 
C. sinensis x P. trifoliata; 0.806**, C. sunki; 0.879**, 
P. trifoliata; 0.834**, C. wnsfj/«; -0.848**, C. paradisi; 
-0.879**, C. paradisi x P. trifoliata; **P-value < 
0.01). The observed results indicate dominance of 
MB in the Citrus species and P trifoliata and suggest 
that the variation in the usage of synonymous 
codons among the genes in Citrus species and 
P. trifoliata is largely a biased representation of 
nucleotide content of the genes. 

3.3. Identification of optimal codons in Citrus species 
and P. trifoliata 
Earlier and recent reports suggest that the usage of 
optimal codons and balanced codon usage enhances 
the efficiency of translation by increasing the translation 
rate of the preferred codons over the other synonymous 
codon choices. 47,48 Genes using optimal codons have 
higher translation rate as compared to genes using 
non-optimal codons, which in turn increases the ribo- 
some usage efficiency and potentially reduces the ribo- 
some drop off. 22,49,50 ESTs constitute partial 
transcriptome representation correlated with gene 
abundance and expression. 51,52 In 14 Citrus species 
(C. aurantifolia; C. aurantium; C. Clementina; C. jambhiri; 
C. limettioides; C. limonia; C. medica; C. reshni; C. reticulata; 
C. reticulata x C. temple; C. sinensis x P. trifoliata; 
C. unshiu; C. paradisi; C. paradisi x P. trifoliata) and 




Figure 3. Heat map of the average RSCU of the 59 degenerate codons in the Citrus species and P. trifoliata using Euclidean distance and 
average linkage clustering module. 
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P. trifoliata, we extracted the expression of ESTs count 
using the CAP3 assembly (.ace files) and compared 
the usage of codons in accordance with earlier 
reports. 53-55 We also selected the ribosomal protein- 
encoding genes and quantified the association of the 
ribosomal encoding genes with axis 1 (COA/RSCU) 
and axis 2 (COA/RSCU) of correspondence analysis. 

A star map showing optimal codons was con- 
structed taking 1 0% of the genes from the extreme 
tails of multivariate analysis (Table 2), and several dis- 
tinct trends of optimal codons were detected. In an 
earlier study, 38 optimal codons in Citrus were esti- 
mated based on the correspondence analysis of 
codon usage, and the relative frequency of synonym- 
ous codon using 177 CDS and A- or T-ending 
optimal codons (TAA, GCT, GAT, CTT, AGG, AGA, and 
GTT) was observed. Overall, we observed that the 
trend of optimal codons usage was not conserved 
across all analysed species. However, in our analyses, 
we have observed that GCT (~1.57; RSCU) and GGT 
(~1.09; RSCU) representing Alanine and Glycine 
were found to be evolutionary conserved across all 
the species. Because both these optimal codons are 
T-ending codons, which is an indication of the domin- 
ant role of the MB in the conservation of the A- or T- 
ending codons, it potentially represents the biased 
genome composition. However, for the other 
observed optimal codons, pattern genomic compos- 
ition was not able to explain the deviation. We 
further observed that for most amino acids with 2- 
to 6-fold degeneracy level, there has been a general 
preference for the usage of two or more codons as 
optimal codons. For example, in Glycine, two highly 
distributed optimal codons GGA and GGT were iden- 
tified and they could be classified as the primary 
and secondary optimal codons preferentially based 
on the RSCU (GGA, -1.16 and GGT, -1.09). A 
curious trend observed among the Citrus species and 
P. trifoliata is that Leucine is frequently encoded opti- 
mally by a G-ending codon (TTG; ~RSCU: 1.49) in 
four species (C. sinensis, C. Clementina, C. reticulata, 
and P. trifoliata) rather than synonymous A- or T- 
ending codon. However, in other species, CTT 
(-RSCU: 1.49) and CTC (-RSCU: 0.89) (TTA, RSCU: 
1.1 6; CTT RSCU: 1.51) were the optimal codons. 

We found deviations in the usage of optimal codons 
encoding Lysine. For example, Lysine is frequently 
encoded by AAA (~RSCU: 0.98) as optimal codon in 
three species (C. sinensis, C. reticulata, and P. trifo- 
liata), whereas AAG (~RSCU: 1.01) was found to be 
the optimal codon for Lysine in rest of the Citrus 
species. In Arginine, we observed that in two species 
(C. auranti folia and C. paradisi x P. trifoliata), AGG 
represents the potential optimal codon instead of 
the synonymous AGA. But based on the RSCU values, 
AGA (~RSCU: 1.66) was assumed to be more 



dominant over the AGG codon (~RSCU: 1 .52) at the 
respective levels of significance P-value < 0.01 and 
P-value < 0.05. Using a mutation and selection 
model, Knight et al. 56 demonstrated that 'pairs of 
species with convergent GC content might also evolve 
convergent protein sequences, especially at func- 
tionally unconstrained positions'. For instance, the 
frequencies of both Lysine and Arginine are highly 
anti-correlated with GC content, and Lysine and 
Arginine can easily be substituted for one another in 
proteins. This observation explains the inclusion of 
Lysine and Arginine in our study and is well supported 
by an earlier study in nematodes. 57 It has been pro- 
posed that Arginine and Leucine have a tendency to 
show different codon usage patterns because of the 
prevalence of the synonymous GC substitutions in the 
first and the third codon position. 58 Recently, it has 
been postulated that codon optimization significantly 
enhanced MIR gene expression \nSolanum lycopersicum 
cv. Microtom, which potentially depicts the import- 
ance and the usage of the optimal codons in gene ex- 
pression and transgenics. To our knowledge, this is 
the first time large-scale identification of optimal 
codons in Citrus species and P. trifoliata, which could 
serve as a model repertoire to enhance the transform- 
ation efficiency. 

3.4. Role of other selective constraints on codon bias in 
Citrus species and P. trifoliata 

A recent study has revealed that MB deeply influ- 
ences the folding stability of proteins, making proteins 
on the average less hydrophobic and, therefore, less 
stable with respect to unfolding and also less suscep- 
tible to misfolding and aggregation. 60 To identify 
the potential effects of Gravy and aromaticity, we 
computed non-parametric correlation coefficients 
between axis 1 (COA/RSCU) and Gravy or aromaticity 
scores for all studied species. We observed that Gravy 
and aromaticity played a minor role in shaping the 
variation of codon usage across Citrus species and P 
trifoliata. In the case of P. trifoliata, no significant cor- 
relation was observed for aromaticity, suggesting that 
aromaticity has no significant role in shaping the 
codon usage variation in this genus. 

3.5. Variations in GC 3 across Citrus species and P. 
trifoliata 

Recent reports suggest that GC 3 composition and 
GC gradient are acting as major factors along the 
orientation of transcription in monocots. It has been 
also suggested that GC composition is vital for under- 
standing chromatin remodelling, gene expression, and 
recombination. 45,61 Some studies failed to depict GC 
gradient along the genes of dicot plants using A. thali- 
ana as a model. 61 The reason for this failure is that 
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Codon/AA ABC 

TTT/Phe * 

TTC/Phe * * 

TTA/Leu * 

TTG/Leu * * 

CTT/Leu * * * 

CTC/Leu * * 

CTA/Leu * 

CTC/Leu 

ATT/lie * 

ATC/Ile * * 

ATA/ lie * 

GTT/Val * * * 

GTC/Val * * 

GTA/Val * 

GTC/Val 

TCT/Ser * * * 

TCC/Ser * * 

TCA/Ser * * * 

TCG/Ser 

AGT/Ser * 
AGC/Ser 

CCT/Pro * * * 

CCC/Pro 

CCA/Pro * * * 

CCG/Pro 

ACT/Thr * * * 

ACC/Thr * * 

ACA/Thr * 
ACC/Thr 

GCT/Ala * * * 

GCC/AIa * * 

CCA/Ala * 
GCC/AIa 

TAT/Tyr * 
TAC/Tyr * * 

CAT/His * * 

CAC/His * 
CAA/Gln * * 

CAG/GIn * 
AAT/Asn * * * 

AAA/Lys * 
AAG/Lys * * 

GAT/Asp * * * 

GAA/Glu * * 

GAG/Glu * 
TGT/Cys * 
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Table 2. Continued 



Codon/AA ABCDEFGHIJ KLMNOP 

/Cys * * * * * * * * * * 

££j~p yy^j-g ******** ** *** 

CGC/Arg ****** * ** *** 

CGA/Arg * * * * * * 
CGG/Arg 

AGA/Arg ***** * * 

AGG/Arg * * 

GCT/Gly ************* *** 

GGC/Gly * * * * * 

GGA/Gly * ** ** ** * * 

Species are in the following order: A= C. sinensis; B = C. aurantifolia; C = C. aurantium; D = C. Clementina; E = C. jambhiri; 
F = C. limettioides; G = C. limonia; H = C. medica; I = C. reshni; J = C. reticulata; K = C. reticulata x C. temple; L = C. sinensis x 

R trifoliata; M = P. trifoliata; N = C. unshiu; 0 = C. paradisi; and P = C. paradisi x P. trifoliata. AA represents amino acid. 



high and low GC 3 genes have opposite gradients along 
the genes, and when high and low GC 3 genes are 
clumped together, the effect disappears. In C/trws 
species and P. trifoliata, unimodal bell-shaped distri- 
bution of GC 3 , centred at 0.39, is considered to be a 
typical mode of GC 3 distribution for dicot species. 
We selected the top and bottom 5% of genes across 
each studied Citrus species and P. trifoliata, and strik- 
ingly distinct gradients of GC 3 and CG 3 skew for high 
and low GC 3 genes were revealed. Genes with high 
GC 3 showed higher level of GC 3 codons in their 
middle coding regions than in terminal coding 
regions. In addition, high GC 3 genes have a preference 
for C over G in the middle coding regions, and this 
preference was not found in the 3'-coding end 
~1 00 bp in size (Fig. 4; Supplementary Fig. 2). 

It was previously reported that stress-related genes 
in grasses are GC3-rich. 45 We stratified all Citrus 
species and P, trifoliata genes into three groups by 
GC 3 : rich (top 5%), poor (bottom 5%), and medium 
(middle 90%). We observed that among the four 
major economically important Citrus species, in C. 
reticulata (1718), C. Clementina (1 522), and 
P. trifoliata (951), most of the GC 3 -rich genes were 
related to stress in comparison to the total number 
of observed stress-related genes, but in the case of 
C. sinensis, it was observed that a high number of 
medium GC 3 genes (1857) were also abundant in 
stress-related genes. The relative abundance of 
the medium GC 3 genes in C. sinensis may be due to 
the time-course adaptability of this species during the 
period of evolution, suggesting an adaptive evolution 
towards the stress. 

A high abundance of stress-related genes in C. reti- 
culata and P. trifoliata may be closely related to their 
high stress-tolerance trait and especially to P. trifoliata 



that has the highest cold-resistant character among 
Citrus species and its wild relatives. The occurrence 
of the high GC 3 genes in all these species suggests 
an association between DNA methylation and GC 3 
as it has been previously suggested that high GC 3 
composition has been influenced by the GC mis- 
match-repair mechanism that is dominant in stress- 
associated genes as the absence of this positive bias 
may lead to the loss of the recombination repair 
mechanism and could be detrimental to the plant 
adaptation in the evolving stress conditions. 
Genomic regions under higher selective pressure are 
more frequently recombining, and as a result relative 
increase in GC 3 content can be observed. 45 As shown 
in Supplementary Table 2 and Fig. 5, in all Citrus 
species and P. trifoliata, the ratio of stress to non- 
stress-related genes in GC 3 -rich group was elevated 
in comparison to the GC 3 -medium group and 
depleted in the GC 3 -poor group. It is worth noting 
that GC 3 -poor group has fewer genes functionally 
described as stress-related when compared with the 
GC 3 -rich group. 

In all the studied species, GC 3 -rich and -poor genes 
have different trends from 5' to 3' end of the genes. 
GC 3 -rich genes become even more GC 3 rich, and 
GC 3 -poor genes become more GC 3 poor. Firstly, the 
GC 3 -rich group has more stress related genes than 
the GC 3 -poor group. Secondly, the GC 3 -rich group 
has a positive gradient of GC 3 from 5' and 3' flanks 
to the middle portion of the CDS. The low-GC 3 
group has a negative gradient of GC 3 from 5' and 3' 
flanks to the middle portion of the CDS. Thirdly, 
rich- and poor-GC 3 groups showed different CG 3 
skew trends along the CDS: GC 3 -rich genes favour C 3 
over G 3 , and this preference is most pronounced in 
first 1 50 codons, whereas GC 3 -poor genes favour G 3 



;ure 4. GC 3 gradient across Citrus species and P. trifoliata. Species order: A — C. sinensis; B = C. aurantifolia; C — C. aurantium; D= C. Clementina; E — C. Clementina x C. tangerina; F 
jambhiri; C = C. japonica var. margarita; H = C. limettioides; I = C. limonia; J = C. medica; K— C. reshni; L = C. reticulata; M — C. reticulata x C. temple; N = C. sinensis x P. trifoliata; 
C. sunki; P = P trifoliata; Q = C. unshiu; R — C. paradisi; and S = C. paradisi x P trifoliata. These plots show gradients of GC 3 for 5% of GC 3 -rich and GC 3 -poor genes for all ana 
genomes. Gradients from 5' and 3' ends of the CDS are shown in the same plot, separated by a vertical line. For all genomes, GC 3 -rich genes become more GC 3 rich towards 
middle of the CDS. 



146 



Codon Evolution in Citrus species and Poncirus trifoliata 



[Vol. 20, 



2000 

IK(H) 
I f»O0 
1400 
1200 
HXXI 
800 
600 
400 



331 
I 



1522 

•>51 945 
„ I ■ | ,, [W 104 47 ■ « 1^ 72 f | " 6 



High GC 3 and stress-related 
genes 



I 1 J i ! | I ! I 1 ! I f 1 1 1 } 1 



2<Mi0 
1800 
1000 
(400 
I2IMJ 
I (MID 

•00 
600 
4 "° 



| 74 j 12*1 1 - I 4 | 10 1 2 _ - I I 



1857 

I 



Medium GC } and stress- 
related genes 



1 



1 r 



i 



a g 

B E 
I I 



I80Q 

I Mil) 

1400 

1000 Low GCj and stress- 

too 

000 
«M 

SM 47 102 |3 53 ia H 4 28 



mi related genes 

i 



i 



I I - 3 | | 

u 

Figure 5. Distribution of stress-related genes in Citrus species and P. trifoliata according to the GO of Arabidopsis. 



over C 3 (Supplementary Fig. 2). It was suggested 
that transcriptional and translational optimization 
of stress-related and tissue-specific genes is the 
major force responsible for maintaining high GC 3 
content. 46 The replacement of the AT pair with the 
GC pair at the third codon position enhances the tran- 
scriptional activity that in turn enhances the array of 
ribosomal-binding proteins. Stayssman et al. 62 
provide a strong positive correlation between the 
methylation of internal unmethylated regions and ex- 
pression of the host gene and postulated that the genes 
with high GC 3 provide more targets for methylation. 
Genes involved in response to various stresses need 
to produce a protein with a faster rate as a response 
to external stimulus. This results in shortened length 
of transcript and preference for G and C (and 
especially, C) in the third position of the codon (to 
avoid abortive transcription and ribosome conges- 
tion). In addition, high GC 3 genes have more methyla- 
tion targets that a How for fine-tuning of transcriptional 
regulation. 58 



3.6. Evolution and coevoiution of nucleotide and amino 
acid composition 
We analysed the relationship between amino acid 
sequence divergence and variation in GC 3 , to test 
the effect of evolutionary divergence on codon 
usage. We observed that there is an overall positive 
correlation (0.43) between relative change in GC 3 
and normalized Hamming distance (Fig. 6). This was 
expected because more diverse amino acid sequences 
are likely to have more diverse nucleotide sequences. 
The trend is not uniform: for the genes with GC 3 in 
the range between 0.5 and 0.6, the correlation is 
0.596 and it drops to 0.32 for GC 3 above 0.8 or 
below 0.4. There are some pairs of organisms that 
have negative correlation between GC 3 and 
Hamming distance (C. jambhiri: C. limettioides, C. 
limonia: C. sinensis x P. trifoliata, C. aurantiifolia: C. 
jambhiri, C. jambhiri: C. paradisi, C. aurantiifolia: C. 
paradisi, C. jambhiri: P. trifoliata, C. medica: C. 
sinensis x P. trifoliata, C. jambhiri: C. unshiu, C. jamb- 
hiri: C. reshni, C. japonica var. margarita: C. paradisi, 
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Figure 6. Hamming distance versus change in GC 3 visualization across Citrus species and P. trifoliata. Relative difference GC 3 composition 
between genes A and B, defined as CC 3 (A)-GC 3 (B)/[GC 3 (A) + GC 3 (B)] is positively correlated with the Hamming distance between 
corresponding protein sequences, calculated as (AB) + I (BA)]/[I (AA) + I (BB)], where I (AB) is the number of identities 

in alignment of protein A to protein B. The resulting Hamming distance was rounded to 1 DP, and an average difference in GC 3 was 
computed for each category. 



C. aurantiifolia: C. Clementina x C. tangerina, C. auran- 
tium: C. jambhiri, C. Clementina x C. tangerina: C. jamb- 
hiri). Note that in the list of negative correlations, C. 
jambhiri occurs eight times. 

Genes that have low GC 3 content in C. jambhiri 
(<0.4) tend to have negative correlation between dif- 
ference in GC 3 and Hamming distance, and genes 
with high GC 3 content (>0.6) are slightly positively 
correlated. This atypical behaviour can be explained 
by larger evolutionary distance between C. jambhiri 
and other species analysed in the present investiga- 
tion. The highest correlation was observed between 
C. japonica var. margarita and C. limonia (p=0.86, 
number of orthologous pairs N= 12, t-statistics t = 
5.22, P-value = 0.0002). C. japonica var. margarita 
and C. limonia are evolutionary distant, and the 
number of orthologous pairs is only N = 1 2. Hence, 
a high value of correlation coefficient may be a 
result of an evolutionary pressure to conserve codon 
usage for a selected subset of conserved proteins. 
We observed that pairs of species having positive cor- 
relation between GC 3 and Hamming distance are 
enriched in translation, transport, proteolysis, photo- 
respiration, and genes involved in photosynthesis; 
pairs of species with negative correlation are enriched 
in stress-response genes. 



Similar patterns of hierarchal clustering were 
depicted; when we clustered the species based on 
co-orthologous genes using genome-predicted and 
frame-corrected reconstructed proteins of the Citrus 
species and P. trifoliata. The phylogenetic clades 
were rerooted using P. trifoliata as an outgroup, and 
it was observed that C. sinensis, C. Clementina, and C. 
reticulata all belong to the same clade, which are in 
accordance with the previous reports using major in- 
trinsic proteins (XIP subfamily of aquaporins) and 
further support the conserved homology between 
these two species. 63 These supportive views suggest 
that the identified frame-corrected coding regions 
are trustworthy and accurate to be used for the pre- 
dictions in the species, with no genome information 
available till now (Supplementary Fig. 3A and B). 

Because a nucleotide bias can lead to an overall bias 
in amino acid composition of proteins, it is possible 
that a genome with nucleotide bias may have intro- 
duced atypical amino acid substitutions in its prote- 
ome. Hence, AT-rich coding sequences would encode 
proteins rich in FYMINK amino acids (Phenylalanine, 
Tyrosine, Methionine, Isoleucine, Asparagine, and 
Lysine), whereas GC-rich coding sequences would 
produce proteins containing high levels of GARP 
amino acids (Glycine, Alanine, Arginine, and Proline). 
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Because Citrus species and P. trifoliata are AT-rich, we 
found higher proportion of FYMINK [(FYMINK and 
CDSgc (j/ = 4.1 431* + 22.341, R 2 = 0.9099); GARP 
and CDS GC (j/ = 3.0687x + 45.41 4, R 2 = 0.7803); 
P-value < 1 0" 9 )]. 

Because the third synonymous position does not in- 
fluence the protein sequence, we computed the cor- 
relation coefficient between the FYMINK and GARP 
parameters and the GC 3 . This correlation is an im- 
portant factor in describing nucleotide bias at syn- 
onymous and non-synonymous sites. We observed a 
high significant correlation between nucleotide com- 
position and amino acid composition at the third syn- 
onymous position [(FYMINK and CDS CC 3 (y = 
1 0.862* + 1 42.4, R 2 = 0.9296); GARP and CDS GC3 
(y= 9.842* + 21 2.1 5, R 2 = 0.7495)], suggesting 
that nucleotide bias has an influence on amino acid 
composition in Citrus species and P trifoliata. 

In summary, we found prevalence of A- orT-ending 
codons with an exception for Lysine and Arginine in P. 
trifoliata. We suggest that although the patterns of 
optimal codons were not conserved, two codons 
(GCT and GGT) were found to be conserved across 
all the Citrus species and P. trifoliata. We analysed 
GC 3 -rich and -poor genes and their association with 
stress, and our results provided novel insights into 
stress evolution of stress adaptation in Citrus species 
and P. trifoliata in accordance with the GC 3 biology. 
This research is critical for all Citrus species, as it 
might facilitate understanding of genome dynamics 
and evolution in Citrus and P. trifoliata, transform- 
ation of interested target genes, designing multitar- 
geting gene systems, and ultimately genetically 
improving these important fruit crops. 

Acknowledgements: The authors thank Dr Kenta 
Nakai (the handling editor) and the two anonymous 
reviewers for helpful and constructive comments 
and Dr Yuepeng Han, Wuhan Botanic Garden (CAS) 
for critical reading of the manuscript. G.S. thanks the 
computational support provided from Research and 
Innovation Center, CRI-FEM, IASMA, Italy. T.V.T. 
thanks the Research Investment Scheme, University 
of Glamorgan, UK. 

Supplementary data: Supplementary Data are 
available at www.dnaresearch.oxfordjournals.org. 



Funding 

This research was financially supported by the 
Ministry of Science and Technology of China (nos. 
201 1AA1 00205, 201 1 CB1 00606), the National 
NSF of China, and the Ministry of Agriculture of 
China (no. 200903044). 



References 

1. Serres-Giardi, L, Belkhir, K., David, J. and Glemin, S. 
201 2, Patterns and evolution of nucleotide landscapes 
in seed plants, Plant Cell, 24, 1 379-97. 

2. Hershberg, R. and Petrov, D.A. 2009, General rules for 
optimal codon choice, PLoS Genet, 5, e1 000556. 

3. Grantham, R., Gautier, C., Gouy, M., Jacobzone, M. and 
Mercier, R. 1981, Codon catalog usage is a genome 
strategy modulated for gene expressivity, Nucleic Acids 
Res., 9, 43-74. 

4. Akashi, H. 2001 , Gene expression and molecular evolu- 
tion, Curr. Opin. Cenet. Dev., 11, 660-6. 

5. Aragones, L, Guix, S., Ribes, E., Bosch, A. and Pinto, R.M. 
2010, Fine-tuning translation kinetics selection as the 
driving force of codon usage bias in the hepatitis A 
virus capsid, PLoS Pathol., 6, e1 000797. 

6. Medigue, C., Rouxel, T., Vigier, P., Henaut, A. and 
Danchin, A. 1 991 , Evidence for horizontal gene transfer 
in Escherichia coli speciation, j. Mol. Biol., 222, 851 -6. 

7. Pascal, G., M'edigue, C. and Danchin, A. 2005, Universal 
biases in protein composition of model prokaryotes, 
Proteins, 60, 2 7-35. 

8. Duret, L. 2002, Evolution of synonymous codon usage 
in metazoans, Curr. Opin. Cenet. Dev., 12, 640-9. 

9. Bulmer, M. 1 991 , The selection-mutation drift theory of 
synonymous codon usage, Genetics, 129, 897-907. 

10. Sharp, P.M. and Matassi, G. 1994, Codon usage and 
genome evolution, Curr. Opin. Genet. Dev., 4, 851 -60. 

1 1. Akashi, H. and Eyre-Walker, A. 1998, Translational se- 
lection and molecular evolution, Curr. Opin. Genet. 
Dev., 8, 688-93. 

1 2. Gupta, S.K. and Ghosh, T.C. 2001, Gene expressivity is 
the main factor in dictating the codon usage variation 
among the genes in Pseudomonas aeruginosa, Gene, 
273, 63-70. 

1 3. Hershberg, R. and Petrov, D.A. 2008, Selection on codon 

bias, Annu. Rev. Genet, 42, 287-99. 
14. Harrison, R.J. and Charlesworth, B. 201 1, Biased gene 

conversion affects patterns of codon usage and amino 

acid usage in the Saccharomyces sensu stricto group of 

yeasts, Mol. Biol. Evol., 28, 11 7-29. 
1 5. Hill, W.G. and Robertson, A. 1 966, The effect of linkage 

on limits to artificial selection, Genet Res., 8, 269-94. 

16. Mclnerney, J.O. 1998, Replicational and transcriptional 
selection on codon usage in Borrelia burgdorferi, Proc. 
Natl. Acad. Sci. USA, 95, 1 0698-703. 

1 7. Chen, S.L., Lee, W., Hottes, A.K., Shapiro, L. and 
McAdams, H.H. 2004, Codon usage between genomes 
is constrained by genome-wide mutational processes, 
Proc. Natl. Acad. Sci. USA, 101, 3480-85. 

1 8. Qin, H., Biao, W.W., Comeron, J.M., Kreitman, M. and 
Li, W.H. 2004, Intragenic spatial patterns of codon 
usage bias in prokaryotic and eukaryotic genomes, 
Genetics, 168, 2245-60. 

19. Stoletzki, N. and Eyre-Walker, A. 2007, Synonymous 
codon usage in Escherichia coli: selection for translation- 
al accuracy, Mol. Biol. Evol., 24, 3 74-81 . 

20. Lin, Y.S., Byrnes, J.K., Hwang, J.K. and Li, W.H. 2006, 
Codon-usage bias versus gene conversion in the 



No. 2] 



T. Ahmad et al. 



149 



evolution of yeast duplicate genes, Proc. Natl. Acad. Sci. 
USA, 103, 1441 2-1 6. 

21. Zhou, T., Sun, X. and Lu, Z. 2006, Synonymous codon 
usage in environmental Chlamydia UWE2 5 reflects an 
evolutional divergence from pathogenic Chlamydiae, 
Gene, 368, 1 1 7-25. 

22. Sharp, P.M., Emery, L.R. and Zeng, K. 2010, Forces that 
influence the evolution of codon bias. Philos. 
Trans. R. Soc. land. B Biol. Sci., 365,1203-12. 

23. Wright, F. 1 990, The 'effective number of codons' used 
in a gene, Gene, 87, 23-29. 

24. Wan, X.F., Xu, D., Kleinhofs, A. and Zhou, J. 2004, 
Quantitative relationship between synonymous bias 
and GC composition across unicellular genomes, BMC 
Evol. Biol., 4, 1 9. 

25. Xu, Q., Chen, LL, Ruan, X, et al. 2012, The draft 
genome of sweet orange (Citrus sinensis), Nat. Genet, 
doi:1 0.1 038/ng.2472. 

26. Huang, X. and Madan, A. 1 999, CAP 3: a DNA sequence 
assembly program, Genome Res., 9, 868-77. 

27. Gouzy, J., Carrere, S. and Schiex, T. 2009, FrameDP: sen- 
sitive peptide detection on noisy matured sequences, 
Bioinformatics, 25, 670-71 . 

28. Schiex, T, Gouzy, J., Moisan, A. and de Oliveira, Y. 2003, 
FrameD: a flexible program for quality check and gene 
prediction in prokaryotic genomes and noisy matured 
eukaryotic sequences, Nucleic Acids Res., 31, 373-81 . 

29. Altschul, S.F., Madden, T.L, Schaffer, AA, et al. 1 997, 
Gapped BLAST and PSI-BLAST: a new generation of 
protein database search programs, Nucleic Acids Res., 
25, 3389-402. 

30. Greenacre, M.J. 1984, Theory and Application of 
Correspondence Analysis. Academic Press: London, 223 
pp. 

31. Sharp, P.M. and Li, W.H. 1986, An evolutionary perspec- 
tive on synonymous codon usage in unicellular organ- 
isms,;. Mol. Biol., 24, 2 8-38. 

32. KyteJ. and Doolittle, R. 1 982, A simple method for dis- 
playing the hydropathic character of a protein, J. Mol. 
Biol., 157, 1 05-32. 

33. Ikemura, T. 1985, Codon usage and tRNA content in 
unicellular and multicellular organisms, Mol. Biol. 
Evol., 2, 1 3-34. 

34. Sablok, G., Nayak, K., Vazquez, F. and Tatarinova, TV. 
2011, Synonymous codon usage, GC3 and evolutionary 
patterns across plastomes of three pooid model species 
- emerging grass genome models for monocots, Mol. 
Biotechnol., 49, 1 1 6-2 8. 

35. Lechner, M., Findeiss, S., Steiner, L, Marz, M., Stadler, P.F. 
and Prohaska, S.J. 2011, Proteinortho: detection of (co-) 
orthologs in large-scale analysis, BMC Bioinformatics, 12, 
1 24. 

36. Foster, P.G., Jermiin, L.S. and Hickey, D.A 1 997, 
Nucleotide composition bias affects amino acid 
content in proteins coded by animal mitochondria, 
J. Mol. Evol., 44, 282-88. 

37. Singer, G.A.C. and Hickey, D.A. 2000, Nucleotide bias 
causes a genomewide bias in the amino acid compos- 
ition of proteins, Mol. Biol. Evol., 1 7, 1 581 -88. 

38. Hu, G.B., Zhang, S.L., Xu, C.J. and Lin, S.Q. 2006, Analysis 
of codon usage in Citrus, J. Fruit Sci., 23, 479-85. 



39. Sueoka, N. 1964, On the evolution of informational 
macromolecules. In: Bryson, V. and Vogel, H.J. (eds.), 
Evolving Genes and Proteins. Academic Press: New York, 
pp. 479-96. 

40. Sueoka, N. and Kawanishi, Y. 2000, DNA G+ C content 
of the third codon position and codon usage biases of 
human genes, Gene, 261, 53-62. 

41. Sueoka, N. 1 988, Directional mutation pressure and 
neutral molecular evolution, Proc. Natl. Acad. Sci. USA, 
85, 2653-57. 

42. Wang, H.C. and Hickey, D.A. 2007, Rapid divergence of 
codon usage patterns within the rice genome, BMC 
Evol. Biol., 7, Suppl. 1 , S6. 

43. Gupta, S.K., Bhattacharyya, T.K. and Ghosh, T.C. 2004, 
Synonymous codon usage in Lactococcus lactis: muta- 
tional bias versus translational selection, J. Biomol. 
Struct. Dyn., 21, 527-36. 

44. Wright, F. and Bibb, M.J. 1 992, Codon usage in the 
G+C-rich Streptomyces genome, Gene, 113, 55-65. 

45. Tatarinova, T, Alexandrov, N., Bouck.J. and Feldmann, K. 
2010, GC 3 biology in corn, rice, sorghum and other 
grasses, BMC Genomics, 11, 308. 

46. Sharp, P.M., Cowe, E., Higgins, D.G., Shields, D.C., 
Wolfe, K.H. and Wright, F. 1 988, Codon usage in 
Escherichia coli, Bacillus subtilis, Saccharomyces cerevisiae, 
Schizosaccharomyces pombe, Drosophila melanoqaster 
and Homo sapiens; a review of the considerable 
within-species diversity, Nucleic Acids Res., 16, 
8207-1 1. 

47. Andersson, S.G.E. and Kurland, C.G. 1990, Codon pre- 
ferences in free-living microorganisms, Microbiol. Rev., 
54, 198-210. 

48. Qian, W., Yang, J-R., Pearson, N.M., Maclean, C. and 
Zhang, J. 2012, Balanced codon usage optimizes eu- 
karyotic translational efficiency, PLoS Genet, 8, 
e1002603. 

49. Sorensen, MA. and Pedersen, S. 1 991 , Absolute in vivo 
translation rates of individual codons in Escherichia 
coli: the two glutamic acid codons GAA and GAG are 
translated with a threefold difference in rate, /. Mol. 
Biol., 222, 265-80. 

50. Kudla, G., Murray, AW, Tollervey, D. and Plotkin, J.B. 
2009, Coding-sequence determinants of gene expres- 
sion in Escherichia coli, Science, 324, 255—58. 

51. Audic, S. and Claverie, J.M. 1 997, The significance of 
digital gene expression profiles, Genome Res., 7, 
986-95. 

52. Munoz, E.T., Bogarad, LD. and Deem, M.W. 2004, 
Microarray and EST database estimates of mRNA ex- 
pression levels differ: the protein length versus expres- 
sion curve for C. elegans, BMC Genomics, 5, 30. 

53. Cutter, A.D., Wasmuth, J.D. and Washington, N.L 2008, 
Patterns of molecular evolution in Caenorhabditis pre- 
clude ancient origins of selfing, Genetics, 1 78, 
2093-1 04. 

54. Ingvarsson, RD. 2008, Molecular evolution of synonym- 
ous codon usage in Populus, BMC Evol. Biol., 8, 307. 

55. Whittle, C.A., Sun, Y. and Johannesson, H. 2011, 
Evolution of synonymous codon usage in Neurospora 
tetrasperma and Neurospora discrete, Genome Biol. Evol., 
3, 332-43. 



Codon Evolution in Citrus species and Poncirus trifoliata 



[Vol. 20, 



56. Knight, R.D., Freeland, S.J. and Landweber, L.F. 2001, A 
simple model based on mutation and selection explains 
trends in codon and amino acid usage and GC compos- 
ition within and across genomes, Genome Biol., 2, 
researchOOl 0. 

57. Mitreva, M., Wendl, M.C., Martin, J., et al. 2006, Codon 
usage patterns in Nematoda: analysis based on over 
25 million codons in thirty-two species, Genome Biol., 
7, R75. 

58. Palidwor, G.A., Perkins, T.J. and Xia, X. 201 0, A general 
model of codon bias due to GC mutational bias, PLoS 
ONE, 5, e1 3431. 

59. Hiwasa-Tanase, K., Nyarubona, M., Hirai, T., Kato, K., 
Ichikawa, T. and Ezura, H. 201 1, High-level accumula- 
tion of recombinant miraculin protein in transgenic to- 
matoes expressing a synthetic miraculin gene with 
optimized codon usage terminated by the native mira- 
culin terminator, Plant Cell Rep., 30, 1 1 3-24. 



60. Mendez, R., Fritsche, M., Porto, M. and Bastolla, U. 201 0, 
Mutation bias favors protein folding stability in the evo- 
lution of small populations, PLoS Comput. Biol., 6, 
e1 000767. 

61. Jiang, N., Fergusona, A.A., Slotkinb, R.K. and Lischc, D. 
201 1, Pack-mutator-like transposable elements (Pack- 
MULEs) induce directional modification of genes 
through biased insertion and DNA acquisition, Proc. 
Natl. Acad. Sci. USA, 108, 1 537-42. 

62. Stayssman, R., Nejman, D., Roberts, D., et al. 2009, 
Developmental programming of CpG island methyla- 
tion profiles in the human genome, Nat. Struct. Mol. 
Biol., 16, 564-71. 

63. Gupta, A.B. and Sankararamakrishnan, R. 2009, 
Genome-wide analysis of major intrinsic proteins in 
the tree plant Populus trichocarpa: characterization of 
XIP subfamily of aquaporins from evolutionary perspec- 
tive, BMC Plant Biol., 9, 1 34. 



